-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: better dtype inference when doing DataFrame reductions #52788
Conversation
37d9ac0
to
ef89a57
Compare
The reason for the gymnastics/checking for keepdims in #52261 is bc 3rd party EAs won't have implemented keepdims, so will need a deprecation cycle to catch up (though in general I'd be OK with going less far out of our way with this kind of thing) |
pandas/core/arrays/arrow/array.py
Outdated
@@ -1310,6 +1312,12 @@ def pyarrow_meth(data, skip_nulls, **kwargs): | |||
if name == "median": | |||
# GH 52679: Use quantile instead of approximate_median; returns array | |||
result = result[0] | |||
|
|||
if keepdims: | |||
# TODO: is there a way to do this without .as_py() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @jorisvandenbossche suggestions? (also i expect you'll have thoughts on the big-picture approach in this PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorisvandenbossche gentle ping, i don't want to move forward here without your OK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just convert a scalar, so perforemancewise the effect from this is limited. Would be worse if it was a whole array being converted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, using as_py()
is the sensible thing to do here to convert a scalar to a len-1 array (until pyarrow supports it's own scalars in pa.array([..])
)
(we have internal functionality in C++ to convert a scalar to a len-1 array, but not sure that is worth exposing in pyarrow)
53e6241
to
6c25d56
Compare
I've refactored to use |
This still requires downstream authors to implement something new, just avoids issuing a warning to inform them about it. |
Actually it is backward compatible: If they don't implement this, their scalar just gets wrapped in a numpy array, so will work unchanged. So it's backward compatible, they just don't get this improved type inference without implementing |
pandas/core/arrays/masked.py
Outdated
if self.dtype.kind == "f": | ||
np_dtype = "float64" | ||
elif name in ["mean", "median", "var", "std", "skew"]: | ||
np_dtype = "float64" | ||
elif self.dtype.kind == "i": | ||
np_dtype = "int64" | ||
elif self.dtype.kind == "u": | ||
np_dtype = "uint64" | ||
else: | ||
raise TypeError(self.dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should corr, sem, and kurt be part of the first elif? And is this not hit by e.g. a sum of Boolean?
In general I'm not a fan of encoding op-specific logic like this. It can lead to bugs if behavior is changed in one place but not the other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good points, I'll have to look into this again to see if I can do it differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow - nice! So this wasn't necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah It was needed in the first draft, then I changed the code making it unnecessary.
There is still some things missing in this PR. I'll update today.
pandas/tests/extension/base/dim2.py
Outdated
elif method in ["prod", "sum"]: | ||
kwargs["min_count"] = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this need to change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The input array (arr2d
) contains NA values. Previously we were adding scalars, so e.g. sum([pd.NA])
would give pd.NA
, even with min_count=0
.
In the new version, pandas will be able to correctly give result of 0, if min_count=0
while correctly giving result pd.NA
, with min_count=1
.
So this could be considered a bug fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks; is there a test for min_count=0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's isn't currently for the 2-dim case, only the 1-dim case. I can add it, but that will have to be tomorrow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated this.
4e11481
to
a1e338a
Compare
I've updated, so it now works correctly with pyarrow and categorical dtypes. Datetimelikes still missing, likewise docs and maybe some additional tests. So now we can do (notice the dtype): >>> arr = pd.array([1, 2], dtype="int64[pyarrow]")
>>> df = pd.DataFrame({"a": arr, "b": arr})
>>> df.median()
a 1.5
b 1.5
dtype: double[pyarrow] This PR has been rebased to be based on #52890. Could be an idea to get that merged |
c1a1a39
to
5297cf3
Compare
6f76e2c
to
346f043
Compare
pandas/core/arrays/masked.py
Outdated
def _simple_new(cls, values: np.ndarray, mask: npt.NDArray[np.bool_]) -> Self: | ||
result = BaseMaskedArray.__new__(cls) | ||
result._data = values | ||
result._mask = mask | ||
return result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just to save from doing the isinstance check when calling __init__
with copy=False
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is from #53013 (I built this on top of that to see what performance improvements I get here), so can be discussed independently in #53013.
and yes, this is to avoid the checks, when checks are not needed. For this PR i need BaseMaskedArray.reshape
to be fast and this helps with that (and many other cases).
pandas/tests/extension/base/dim2.py
Outdated
elif method in ["prod", "sum"]: | ||
kwargs["min_count"] = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks; is there a test for min_count=0
?
pandas/tests/extension/test_arrow.py
Outdated
@@ -507,6 +507,38 @@ def test_reduce_series(self, data, all_numeric_reductions, skipna, request): | |||
request.node.add_marker(xfail_mark) | |||
super().test_reduce_series(data, all_numeric_reductions, skipna) | |||
|
|||
def check_reduce_and_wrap(self, ser, op_name, skipna): | |||
if op_name in ["count", "kurt", "sem", "skew"]: | |||
pytest.skip(f"{op_name} not an array method") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add an assert with hasattr to ensure the skip reason is valid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
@@ -367,6 +372,30 @@ def check_reduce(self, s, op_name, skipna): | |||
expected = bool(expected) | |||
tm.assert_almost_equal(result, expected) | |||
|
|||
def check_reduce_and_wrap(self, ser: pd.Series, op_name: str, skipna: bool): | |||
if op_name in ["count", "kurt", "sem"]: | |||
pytest.skip(f"{op_name} not an array method") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
cmp_dtype = "boolean" | ||
elif op_name in ["sum", "prod"]: | ||
is_windows_or_32bit = is_platform_windows() or not IS64 | ||
cmp_dtype = "Int32" if skipna and is_windows_or_32bit else "Int64" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't have expected skipna to have an influence here; is this deliberate or an existing issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've had a big struggle to get a handle on how the dtype end up changing on different systems, so this is definitely not deliberate on my part, just trying to get the tests to pass, then it will be easier to fix up whatever needs to be fixed.
But I think you are probably right, I can look into it tomorrow. Having said that, I don't think I've changed anything in this regard, so if this is a bug, this just uncovers an existing bug and it shouldn't be something that has been introduced in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've fixed this, so skipna doesn't affect the dtype.
Also: "I don't think I've changed anything in this regard." are always famous last words. My PR was the cause of this, so good catch.
After the various perf PRs ( #52998, #53013 & #53040), I now get unchanged perf for axis=0 when calling
That means that |
I've improved the
Generally, in the |
pandas/core/arrays/base.py
Outdated
Length: 1, dtype: Int64 | ||
""" | ||
result = self._reduce(name, skipna=skipna, **kwargs) | ||
return np.array([[result]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks like a 2D result. Should it be 1D?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This just moved the existing behavior from pandas.core.internals.blocks.Block.reduce
to here. Having said that, I think you are right, 1D makes more sense here and it passes all tests also, so I've changed it.
pandas/core/arrays/masked.py
Outdated
return result | ||
|
||
def _wrap_na_result(self, *, name, axis): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this feels like it got a lot more complicated than the first attempt at "keepdims". does this just address more corner cases the first attempt missed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this is not great, and I've been tinkered quite a lot with other implementations. I think/hope I will have a simpler solution tonight.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I looked into this and unless I do a relatively major refactoring, it looks like I just move complexity around by changing this. So without a bigger rework of arrays_algos
, I don't see any clearly better method.
Suggestions/ideas that show otherwise welcome, of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to add, the underlying issue is that the functions in masked_reductions.py
only return scalar NA
values. This means that we don't get the type information when doing reductions that return NA
, but have to infer it and the inferring is complex because it depends on calling method, dtype of calling data and OS platform.
The solution would be to return proper 2D results from masked_reductions.py
functions, e.g. return a tuple instead of scalar, e.g.
- return (scalar_value, none), when returning scalar
- return (value_array, mask_array) when returning a masked array
and the wrap the result in an array in BaseMaskedArray._wrap_reduction_result
, when it should be a masked array.
However, I 'm not sure if that's the direction we want to pursue because unless we want to support 2d masked arrays, the current solution could still be simpler than building out 2d reduction support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Longer term, that indeed sounds as a good alternative, even for the current 1D-reshaped-as-2D case. One potential disadvantage is that you still need to do some actual calculation on the data, unnecessarily, to get the value_array
return value (so we can use that to determine the result dtype). Calculating the actual value is "unnecessary" if you know the result will be masked. Of course it avoids hardcoding result dtypes. But for the case of skipna=False with a large array, that might introduce a slowdown?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only "know" because of the hard-coding here, for example Series([1, 2, pd.NA], dtype="int8").sum()
is Int32
on some systems and Int64
on others. So to avoid the hardcoding we will have to do a calculation, so it's a trade-off (or we could hard-code the dtype when we're 100 % sure we get a NA, but that'll be about the same complexity as now + the extra code for doing real 2d ops). I think we should only do the 2d ops if we want it independent of this issue here.
IMO that's a possible future PR, if we choose to go in that direction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only "know" because of the hard-coding here
In the end, also numpy kind of "hard codes" this, but just in the implementation itself. And we get the difficulty of combining the algo of numpy with our own logic, where we also want to know this information. So I don't think it's necessarily "wrong" to hard code this on our side as well (i.e. essentially keep it as you are doing in this PR, also long term (at least as long as the arrays are only 1D)).
Sidenote, from seeing the additional complexity for 32bit systems in deciding the result dtype, I do wonder if we actually want to get rid of that system-dependent behaviour? (we also don't follow numpy's behaviour in the constructors, but always default to int64)
Although given that we still call numpy for the actual operation, it probably doesn't reduce complexity to ensure this dtype guarantee (we would need to move the 32bit check to the implementation of the masked sum)
id expect this to be very close to perf-neutral. where is the slowdown coming from? |
It is performance neutral when the dataframe is not-wide, but when the dataframe gets very wide, there will be a certain slowdown, because pandas will have to create 1 new masked array per column and then join those masked arrays. Joining masked arrays is slower than joining the same number of ndarrays, because the masked array have two ndarrays internally + the masked arrays are written in python while the ndarrays are written in C, which gives some overhead for masked arrays. Having said that, note that the ASV slowdowns that I show is about joining 100.000 columns, i.e quite a lot. I think the performance is actually ok, taking the above explanation into account. |
…ionArray._reduce_and_wrap
…ionArray._reduce_and_wrap
I haven't re-reviewed in a while, but I was happy here as of the last time I looked at it. If Joris is on board, let's do it. |
I've actually just this morning made a new version using I maintain backward compat in this new version by:
This is possible because Lines 10874 to 10895 in f85deab
Notice especially the FutureWarning starting on line 10885. This will allow us to not require Check out the Thoughts? I prefer this new version, but it's is easy to revert back if needed. |
Ping... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good. Just a single comment from me https://github.com/pandas-dev/pandas/pull/52788/files#r1261463068
I've updated, added some + fixed some dtype issues with |
["median", Series([2, 2], index=["B", "C"], dtype="Float64")], | ||
["var", Series([2, 2], index=["B", "C"], dtype="Float64")], | ||
["std", Series([2**0.5, 2**0.5], index=["B", "C"], dtype="Float64")], | ||
["skew", Series([pd.NA, pd.NA], index=["B", "C"], dtype="Float64")], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a follow up if you could add kurt
(if supported) that would be great
Thanks this is an awesome improvement @topper-123 |
nice! thanks for sticking with this @topper-123 |
Yes, but worth the slow process, I wouldn't have thought of this way to add a parameter to |
Needed in order to fix broken reduce operations (pandas-dev#52788). Signed-off-by: Michael Tiemann <[email protected]>
Supercedes #52707 and possibly #52261, depending on reception.
This PR improves on #52707 by not attempting to infer the combined dtypes from the original dtypes and builds heavily on the ideas in #52261, but avoids the overloading of
kwargs
in the extensionarray functions/methods in that PR. Instead this keeps all the work inExtensionArray._reduce
by adding an explicitkeepdims
param to that method signature. This makes this implementation simpler than #52261 IMO.This PR is not finished, and would I appreciate feedback about the direction here compared to #52261 before I proceed. This works currently correctly for masked arrays & ArrowArrays (AFAIKS), but still need the other extensionArray subclasses (datetimelike arrays & Categorical). I've written some tests, but need to write more + write the docs.
CC: @jbrockmendel & @rhshadrach